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[57] ABSTRACT 

A method and apparatus for training a text classifier is 
disclosed. A supervised learning system and an annotation 
system are operated cooperatively to produce a classification 
vector which can be used to classify documents with icsptdt 
to a defined class. The annotation system autcnnatically 
annotates documents with a degree of relevance annotation 
to produce machine annotated data. The degree of relevance 
annotation represents the degree to whidi the doomimt 
belongs to the defined class. This machine annotated data is 
used as input to the supervised learning system. In addition 
to the machine annotated data, the supervised learning 
system can also receive manually annotated data and/or a 
user request The machine aimotated data, along with the 
manually annotated data and/or the user request, are used by 
the supervised learning system to produce a classification 
vector. In one embodiment, the si;9>ervised learning system 
conqslses a relevance feedback medianism. The rdevance 
feedback medianism is operated cooperatively with the 
annotation system for multiple iterations until a classifica- 
tion vector of acceptable accuracy is produced. The classi- 
fication vector produced by the invention is the result of a 
combination of supervised and unsupervised learning. 

29 Claims, 3 Drawing Sheets 
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METHOD AN D APPARATUS FOR TRAINING 
A TEXT CLASSIFIER 

FIELD OFTHE INVEOTION 

The present invention relates generally to computerized 
text cUssification. More paitioilarly, the present invention 
relates to the combined supervised and unsupervised learn- 
ing of text classifiers. 

BACKGROUND OF THE INVENTION 

The amount and variety of data stored on-line is growing 
at a rapid rate. This is particularly true tor natural language 
text in its many forms (news articles, memois, electronic 
mail, repair rqiotts, etc). While there are many potential 
benefits of computa access to this data, they cannot be 
realized unless documents useful under particular drcumr 
stances can be distinguished from ones which are not uscfuL 

An inqxsitant tedinique in on-line text processing is text 
classification, which is the sotting of documents into mean- 
ingful groups. A vanety of text classification systems are 
currently in use. Text retrieval systems atten^)t to separate 
documents from a text database into two groups: those 
which are relevant to a iiser queiy and those which are not 
Text routing systems, or filtering systems, direct doounents 
from an incoming stream of documents to relevant users. 
Text categorization systems srat documents into two os more 
designated classes. Text classification can be ^lied to 
documents which arc purely textual, as well as documents 
which contain both text and other forms of data. 

Text classification is sometimes done manually, by having 
human beings decide what group each document should go 
into. Such a technique is often too cxpcnsiyc to be practical, 
so machines for classifying text, and the methods of 
classification, have become of considerable Interest Such 
madiines are generally programmed digital con^utcrs, and 
are caUed classifiers. Qassifiers are of great inq>oitance to 
text processing. 

For purposes of this discussion, consider a classifier 
which is programmed to distinguish between the class of 
relevant documents and the class of non-relevant dooi- 
ments. In order for such a classifier to be effective, it requires 
knowledge about the structure of relevant and non^elevant 
documents. Such knowledge can be manually programmed 
into a classifier, but this requires considerable time and 
expertise, given the conq>lcxity of language. 

A variety of machine learning techniques have been 
developed to automatically train classifiers. The most comr 
mon automated technique for training a classifier is called 
supervised learning. In such a system, the classifier is trained 
using documents which have been classified manually. Such 
noanual classification requires a user to analyze a set of 
documents, and to decide which documents are relevant and 
which arc not relevant. The user will then label the reviewed 
documents as relevant, or not relevant These labels arc 
called annotations. A document which has such a label 
assigned to it is called an annotated document When the 
annotations are detennined by a user, the documents are 
referred to as manually annotated documents. 

Thus. In si^)ervised learning, the classifier is provided 
with manually annotated documents which are used as 
training data. The classifier uses the training data to learn the 
structure of documents which fall within certain classes. Fes 
example^ the dassifia may cnq>loy a statistical procedure 
which will produce a statistical model of the structure of the 
relevant and non-relevant documents. Hiis statistical model 
may then be used to classify documents which have not been 
annotated. 



,710 

2 

One siqxxvised learning technique vAdch has been widely 
applied to text dassification is called rdevance feedback. In 
relevance feedback, a asa provides a request, whidi is a 
specification of the attributes which a document bdonging 
3 to a dass of intoest is likdy to have. The request typically 
contains words likely to occur in documents bdonging to the 
class, but may also contain other identifiers such as subject 
categories, author nanses, publication dates, assodatod for- 
matted data, etc Ttds request is then used as a query to 
retrieve documents from a document collection. The user 
may then review the retrieved documents and annotate {Lc 
labd) a subset of the documents as relevant to the request, 
and annotate another subset of the documents as not rdevant 
to the request The rdevance feedback mechanism reformu- 
lates the query based upon these manually annotated docu- 
ments. Terms or expressions whidi occur in the rdevant 
documents are emphasized in the reformulated quay. 
Similarly, terms or expressions whidi occur in the non- 
relevant documents are de-eiiq)hasized in the reformulated 
query. The effect of such a query reformulation is to move 
the query in the direction of the rdevant documents and 
away from the non-relevant documents. An exanqde of such 
a rdevance feedback method is the Rocchio algorithm for 
relevance feedback. See, Doima Haiman, "Rdevaitce Feed- 

25 back And Other Query Modification Techniques,'' in Wll- 
iam B. Frakes and Ricardo Baeza- Yates, editors. Informa- 
tion Retrieval: Data Structures and Algorithms^ pages 
241-263, Prentice Hall, Englewood Cliffs, NJ., 1992. Hie 
more documents which are reviewed and manually armo- 

^ tated by the user, the more accurate the resulting query. 
Howeva, the act of maiuially armotating documents is time 
consuming and thus expensive. 

Since supervised learning is e3q>ensive in terms of user 
effcrt, unsupervised learning has also been used to train 

3j classifiers. Unlike siq)ervised learning methods, unsuper- 
vised learning does not require manually aimotated training 
data. Instead, these methods attempt to detect patterns that 
are inherent in a body of text, and produce a statistical modd 
of that inherent strucdire. Since the data used to train tiie 
classifio- with these methods is not annotated, there is no 
user indication as to a particular desired structure. 

The most common approach to qyplying unsup>ervised 
learning in text dassification is to apply unsiq)crvised learn- 
ing to an entire document collection. Hiis attempts to 

45 uncover a simpler structure of the entire collection of 
documents. Variations of this technique have been used. For 
exan^ie, unsupervised learning can be used to reveal the 
underlying structure of words or phrases in the doounents. 
It may be used to reveal an underiying structure of die 

50 documents as a whole by grouping the documents into 
related dusters. Combination of these techniques may also 
be used in which super^ed learning is applied to tx3th 
words and documents. 
These existing tediniques, which apply unsupervised 

55 learning to an entire document collection have not provided 
much improvement to text retrieval effectiveness. One rea- 
son is that there are many different underiying structures that 
can be found in a body of text, and only some of those 
structures will be useful in any particular text classification 

60 task. Since there is no user indication as to the desired 
structure, it is unlikdy that a useful structure will be found. 
It is unlikdy &at purdy unsupervised learning will be 
effective in information retrieval in the near fiiture. 
There have been some attempts to combine siiQ)ervised 

65 and unsupervised learning. In one such aotmpt, supcrmed 
learning is used to train a dasstficr to distinguish documents 
which bdong to some defined dass £rom documents which 
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do not bdong to the dass. The trained dassMcf is then In accordance with one dnbodiment the inventi<m cal- 

appUed to a document collection in order to identify a subset culates a degree of relevance for each noa-manually anno- 

of documents in the collection which are most likely to tated document in a collection and automatically annotates 

belong to the class. An unsupervised learning method is then the documents witti the degree of relevance, &us creating a 

applied to the identified subset to produce a model of the 5 set of machine annotated documents. These machine anco- 

undcrlying structure of the subset. Finally, a second round of tated documents arc then used to train a dassifier, using a 

supervised learning is applied to this model. For cxanqrfe. supervised learning systcnL In addition, the supervised 

the unsuptrvised learning method could form dusters of learning system can also use manually annotated documents 

documents. The second round of unsupervised learning and/or a user request in combination with the madiine 

could then train a classifier to distinguish between dusters of 10 annotated data to improve the accuracy of the resulting 

documents rather than between individual documents. classifier. 

This appioadi limiU the set of documents to which These and other advantages of the invention will be 

unsupervised learning is applied, so that it is more likely that apparent to those of ordinary skill in the art by refocnce to 

structures reflecting the desired distinction into dass mem- the following detailed description and accompanying draw- 

bcrs will be found. This ^>proadi has yidded somewhat is ^5^* 

better results than unsupervised learning ^lied to the entire BRIEF DESCRIPTION OF THE DRAWINGS 

data collection, but is still imperfect The modd of the , . u r **. * * 

. , . ^ . . . J J . FIG. 1 shows a schemaUc of the components of a oom- 

underlyine structure, which is identified dunne unsupcr- auuw* a svutaimuv ^.vaxipuu^uu 

uuuaijfiilit ouw^«4n,. j^*^"* " x**^txuii^ «ui*xi5 ^ system which can be configured to mmlement the 

vised learmng IS unpredictable and cannot be adjusted to suit ^ wii* c «p 

the nature of a particular text collection and classification 20 ^tl^ ^^^^ ^ * - 

iroblem. Furth^e process is complex, requiring at least ^^-^ « * ^ 

two algorithms, (one for supervised learning and one for ^^^^ P""^^^^ invention. 

unsupervised learning) and multiple processing phases, I^G- ^ is a block diagram showmg tfie components of an 

Another attempt at combining supervised and unsuper- embodiment of die invaition. 

vised learning is to automatically annotate documents. This 25 4 is a flow diagram of the operation of the invention, 

qxproach first trains a dassifier with manually annotated DETAILED DESCRIPnON 
documents, using a supervised learning system. The trained 

dassifier is tiien applied to a document collection in oidcr to ^' System Ardntecture 

identify a sTnali subset of highly ranked documents. Hiis As used herein^ the term computer includes any device or 

small set of highly ranked documents is assumed to bdong 30 machine ci9>able of accepting data, applying prescribed 

to the defined set and the documents in the set are then processes to the data, and supplying the results of the 

automatically annotated (i.e. labded) by the system as processes. 

bdonging to the defined class. These documents which are The functions of the present invention are preferaUy 

automatically annotated by the system are called machine performed by a programmed digital computer of the type 

annotated documents. These machine annotated documents, 33 which is well know in the ait, an exan^>le of which is shown 

along with the original manually annotated documents, are in FIG. 1. FIG. I shows a computer system 1%0 which 

used as training data during a second iteration of supervised conqnises a di^lay monitor 102, a textual iapat device such 

learning to re-train the classifier to produce an inq^ved as a computer keyboard 104, a graf^cal input device such 

dassifier. as a mouse 106, a computer processor 108, a memory unit 

This method has worked when a high quality request and ^ 110, and a non- volatile storage device sudi as a disk drive 

a document collection ridi in rdevant documents combine 120. The memory unit HO includes a storage area 112 for the 

to ensure that the documents ranked highly by the initial storage of, for exa]iQ>le, con^Mtcr program code, and a 

dassifier have a very high probability of belonging to the storage area 114 for the storage of data. The computer 

dass. Since it cannot be know in advance that this will be the processor 108 is connected to the display monitor 102, the 

case, assuming that the highly ranked documents belong to memory unit 110, the non-volatile stcsage device 120, the 

the dass is an impeifcct strategy. keyboard 104, and the mouse 106. The external storage 

In the last described method, a certain number of highly device 120 may be usedfor the storage of data and computer 

ranked documents are automatically annotated as being program code. The computer processor 108 exectJtcs the 

rdevant The remaining documents are not annotated at alL computer program code whidi is stored in the memoiy unit 

This qipsoacfa is limited in that it only takes into consider- ^ 110 in storage area 112. During execution, the processes 

ation a small number of documents during the supervised may access data in storage space 114 in the menKiry unit 

learning j^iase. In addition, those documents whidi arc 110, and may access data stared in the non-v(datile borage 

madiine annotated are annotated as entjrdy relevant There device 120. The oon^tcr system 100 may suitably be any 

is no mechanism to attadi a wdght representing the degree one of the types which are wdl known in ttie art sudi as a 

to which a document is relevant to the annotation, mainframe c(»nputer. a miniconqwter, a workstation, or a 

SUMMAKyOFTHEINVENnON m v^- f . 

FIGS. 2 and 3 are block diagrams of conq>onents of me 

The present invention provides a me&od and aRarams present invention, and win be discussed in further detail 

for training a dassifier using a combination of supervised below. It will be understood by those skilled in the art that 

and unsupervised learning. It in^ves on the prior art 60 the con^nents of the present invention as shown in FIGS. 

techniques by aUowing documents to be automaticaUy anno- 2 and 3 may advantageously be implemented using an 

tated as bring partially rdevant and partially not relevant approjiiately programmed digital comjMiter as shown in 

This is adiieved by automatically annotating documents p|Q ]^ 

with a degree of relevance. This technique allows the use of 

the entire document coUection to train the dassifier, witii M 2. Vector Space Modd of Text Rrtneval 

ead) document contributing to the newly produced dassifier For purposes of this description, it is assumed that the 

based upon its degree of rdevance. classifier being trained uses the vector space modd of text 
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icciicvaL The details of the \cCtoi space model of te^d 23 Document GRSsifirati'on 

retrieval are described in: Salton, Automatic Text Process- classification vector c is used to rank the documents 

ing: The Transformation, Analysis, and Retrieval of Irifor- in a collection as follows. The classificatiott vector c is 

motion by Computer, Addison-Wesley Publishing, 1989. applied to a document to calculate a retrieval status value 

Zl Document ROTcsentation ^ (RSV) for the document The RSV of a docsmicnt is com- 

puted according to the following equation: 

In text retrieval systems whidi use the vector space model 

of text retrieval, documents aic represented as vectors of ^ 
numeric weights with one weight for each indexing tcna 
Indexing tcnns are a defined set of terms which are used to 

describe the documents. An example will illustrate this. ^ ^ . . . ^ . . 

Consite a vector space repiesentation wbid. uses 5 index- ^ " ^f" 

ii»gt«nstodesaibedociiments:T,,T,T„T,andT,.Each niultiphed by the conesponduig wa0it torn in to do«- 

document in the cdlecUon nuiy be represented by a veaor vector Tbc sum of these multipbed wagbte pws 

containing 5 numeric weights. Each wdght is associated a retneval status vahieRSV, which represents the rank of the 

withonerfthcinde«ngtmi.s,andrei«««tstheweightaf classified docament TTie higher tte RSV, the^m^ 

the associated term in the document For example, consider ^ doamient falls wiUnn the class of documents 

a system in which the indexing terms are as foHows: represented by the ctossificahon vertor c 

3. Invention Overview 

T^: DOCUMENT 20 A block diagram illustrating a system for training a 

Tj: TEXT • classifier in accordance with the present invention is shown 

T • nn^ ^ ^ ^ ^^^^ ^ overview of the 

T4: DOG components of the system. The detailed operation of the 

Tji SUN system is described in further detail below. A system 200 for 
. . 25 training a classifier includes a supervised learning system 

Consider the foUowmg document D. jlO, an automatic annotation system 220, and a document 

D={Tcxt retrieval systems atten5>t to separate documents database 250. The supervised Icaming system 210 initially 

from a text database into two groups: diosc which are receives a user request and/or manually annotated training 

relevant to a user query and those which are not} <jata, which define a class of documents of interest to a uscx. 
The vector which represents this docuinent may be <1A1. 30 and which arc used to produce a classification vector c This 

0,0>. The first weight represents the wci^t of the Tj term, classification vector c can be used to classify documents in 

the second weight rcjHTscnts the wdght of the Tj term, the database 230 with respect to a class of interest The 

third weight rqjresents the weigjit of the Tj term, the fourth remainder of this detailed description wiU genaally be 

wcigjit rcfffcscnts the weight of the T4 term, and the fifth directed to two classes of documoits: those relevant to a 

wdgjit represents the weight of die T, tcraL Note that the 35 particular user and those not relevant to tfiat user, although 

particular values in the vectOT rqwesenting the document the present invention is not limited to such classes, 

may vary depending on the particular vector space weight- ^h^ classification vector c produced by the supervised 

mg formula bang usaL learning system 210 is input to the annotation systcm220. 

More generally, if d mdaung tenns are bcmg used to annotation system 220 classifies the documents in the 
represent doamients, then the representation of the I'fli 40 database 230 using ttie classification vectw c and automati- 

documcnt in the document database 230 would be: annotates the documents to produce machine annotated 

i,<w« . , . . . . data. The machine annotated data produced by the annota- 

where the identifier i indicates this is the ith document and ^""^ ^^^^ 220 is used as input to the supervised learning 

where Wit is the weight of ttie kth teim in the ith document system 210 dunng subsequent iterations in order to produce 

Methods for conqwting these weights from the raw text of * classification vector c based upon both 1) the machine 

documents are disclosed in the Salton reference dtcd above annotated data, and 2) the manuaUy annotated data and/or 

describing the vector space modd. The indeaiing terms may icq^cst This procedure continues untQ a classification 

be WOTds, as in the exan5)le above, or may be other content sector c of accqrtablc aoairacy is produced. Thus, the 

identifiers, such as dtations, author names, publication supervised learning system 210 is c^blc of receiving 

dates, formatted data, etc as disclosed in Edward Fox, Gary ^ machine annotated data from the annotation system 220 as 

Nuna, and Whay Lee, *X::ocffidents for Combining Concept ^ manuaUy annotated data firom a user. Such a con- 

Classcs in a CoUecticm," in llth International Cor^erences figuration provides for a system 200 which performs a 

on Research and Development in Irrfcrmatian Retneval, pp, combination of sijpcrvised (from a user Truest and/or 

291-307, Grenoble, France, Jun. 13-15, 1988. manuaUy annotated data) and unsupervised (from machine 

annotated data) learning. 

22 Representation of Classification Vector ^n embodiment of the present invention is shown as the 

A classifier based ^pon the vector space model of text system 300 in FIG. 3. In the embodiment shown in FIG. 3, 

retrieval uses a classification vector, c, to classify docu- the stq>crvised learning system 210 indudes a rdevance 

ments. The dassification vector, c, is iqresented in a feedback module 310 and a logistic regression nHxhile 312. 

manner similar to documents. The classification vector c is ^ Hie annotation system 220 indudes an RSV formatting 

a vector of numoic weights with one weight for each module 318, a search engine 324, an initial probaUlity 

indexing tenn: annotator 328, and an iteration probability annotator 322. 

<Wrf ''-I* These conq>onent5 will be described in further detail below. 

In this representation w^ represents the weight of the klh 63 4. ()peration of One Embodiment 

term in the classifier vector c. The c subscript identifies the The operation of the embodiment shown in FIG. 3 is 

weight term as a wdght in the dassification vector. described in conjunction with the flow diagram of FIG. 4. 
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4.1 InitializatioD and User Input weightiiig factor used during the logistic regression phase, 

both of which wiU be described in ftnthcr detail below. Both 

Initialization and user input to the system 300 occur in g ^ 5, arc set to 

step 406. User input to the system 300 is represented as ^ 

conqjuterinput/ou^ut device 502. Four kinds of inputs may ^ uy+w^ + ka 

be provided by the user: manually annotated dociimcnts, a 6i= 63- ^ 

user request, weighting factors, and an estimate of tht 
number of relevant documents (each of these inputs is 

described in further detail in sections 4. 1. 1-4. 1.4 below). where IQI is 1 if the user entered a request T and 0 if the user 

The first two iiq>uts are required while the second two iiq>uts did not enter a request T. (The situation in which a user does 

are optional not enter a request. T is discussed in further detail below.) 

This makes the total inq»act of the machine annotated data 

4. 1. 1 Manually Annotated Documents roughly equal to the iii4)act of the manually annotated data. 

A user may manually annotate a number erf documents The factors a, p, and y are also set by the weighing 

from the document database 230 as cither belonging to the 15 calculator 304. a controls how mudi weigjit the initial 

set of relevant documents cr not belonging to the set of request T has during fonnation of the classification vector c 

relevant documents (Lc. relevant or not relevant). Thus, it is as discussed in further detafl in section 4^.1 bdow. p and y 

assumed that, prior to operation of the system 300, the user control how much weight to give relevant and non-relevant 

has reviewed at least some of the documents in the database documents, respectively, during classification vcdor fonna- 

230 and has made a dctcnnination as to the relevance of at 20 ^'>'^- These relevant and non-relevant documents may be 

least one of the documents. The documents to be annotated documents that were manually annotated, ca- documents 

may have been found as the result of a prior query in which which were machine annotated. Reasonable values for these 

a text retrieval system returned documents, by browsing the parametas arc (x=8, p=16, and Y=4, based on the discussion 

documents, or by other means. Let represent a set of of the setting of tiiesc parameters in Chris Buckley, Cjtrard 

identifiers for documents Aat have been manually annotated 25 Salton, and James Allan, *The Effect Of Adding Relevance 

as being relevant, and RJ represent the number of such Information In A Relevance Feedback Environment-, in W. 

documents. Similarly, represents a set of identifies for Bnice Croft and C. J. van Rijsbogen, editors, SIGIR 94: 

documents that have been manually annotated as being not Proceedings of Ote Seventeenth Annual International ACM- 

relevant, and 1^ represents the number of such documents. SIGIR Conference on Research and Development in Irfor- 

Thus RJ-MRJ is the numbff of documents which have been 30 yf^aHon Retrieval, pages 292-300, London, 1994, Springcr- 

manually annotated. Note that from document database 230 Verlag. 

and sets R^ and R^, the set of documents that have not been For any particular ^plication, tuning of these five factors 

manually annotated can be determined. By oonq^aring the may lead to better results, and thus, in an altonate 

sets R^ and with the doaunent database 230, the set U can embodiment, a mechanism may be provided to allow the 

be determined, u^ere U rqxrescnts a set of identifiers of 35 user to bypass the weighting calculator 304 and manually 

documents which are in the database 230 but which have not enter the values of 5i, S3 ot, p, and y. 
been manually annotated. lUI represents the number of such 

documents. These documents identified by the set U are also 4.1 .4 Initial Annotation of Documents with 

referred to as non-manually-annotated documents. Probability Estimates 

40 

4.1.2 User Request por each document i in the database 230, an initial 

Uscrinput302alsoconsistsofarequcstTwhidispecifies Probability annotate 328 annotates the document with an 

words and possibly other attributes the user beUcvcs are ^ relevance estmiate whidi represents the 

likely to occur in the relevant documents. This request T is <^ff^ ^ which docurnent 1 belongs to the class of relevant 

wovided to a request processor 306, which converts the « documents. In one embodiment, this degree of relevance is 

request T into a query Q, which is a vector of nummc ^h*^ probabihty, P„ &at dociuMnt belongs to the dass of 

weights in accordance with the vector space model of text retevantdoomients. The initial P, is computed m step 415 as 

retrieval. A query vector is represented as: follows. 

For each document identified by the set R^ the set of 
documents manually annotated as relevant, is set to 1.0. 



(>=<Wrf...w^...w^ 50 



w^ represents the weight of the klh term in the query vector Thus, a manual annotation of relevance is rqirescnted as a 

Q. The r subscript identifies the weight term as a wei^t in ^ value oi 1.0. For each document identified by the set 

the query vector. Methods for converting a textual request T R^ the set of documents manually annotated as not relevant, 

into a query Q are the same as for converting a document 35 is set to 0.0. Thus, a manual annotation of nonrdcvance 

into a vector of numeric weights, and arc wcU known in the is represented as a P, value of 0.0. These probabilities which 

art as described above. are detennined by manual annotation of the documents will 

not change during processing since the user has made the 

4.13 Weighting Factors determination of relevance or non-relevance. 

In step 410 five weighting factors are confuted: 5|, S^, oc, 60 For the remaining non-manually-annotated documents 

^ and Y. The first two weighting factors 5x and control identified by the set U, an initial mflrhinr annotation with 

how mudi weight is given to the machine annotated data probability estimates is made as fdlows. The user estimates 

(described in detail below) relative to the wei^tgivoi to the n^, tfie mimbo- of documents in ^ set U of non-manually 

manually annotated data. A weighting calculator 304 annotated documents which belong to the dass of relevant 

receives R^ and firom user iiq>ut 302 and calculates the 65 documents, where 0<a,<1U1. If &e user has no such estimate, 

weighting factors 5^ and 62* where 5i is the weighting factor then a value of n^=l can be used. Each document i identified 

used during the relevance feedback phase, and 5, is the by the set U, is annotated with an initial P| where 
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This captures the user's csdmate as to the iiuid>er of relevant 
documents in the noD-manually annotated data, as well as 
representing the initial ignorance as to which of these 
documents are rclcvanL These automatic initial annota- 
tions will diange during processing. 

42 Qassification Vector F^smation 

The algorithm for producing a classification vector c in 
one embodiment is based on the Roccfaio algoridmi fcr 
relevance feedback, which is well known in the ait and is 
disclosed in Salton, Automatic Text Processing: The 
TransformatiorK Analysis, and RetrUval cf information by 
Computer, Addison-Wesley Publishing. 1989; and in 
Hannan* Chspbcr 11 of Information Retrievat Data StruC' 
tures and Algorithms, Rrentice-Hall Publishing, 1992. Tht 
algcsithm for classificatioa vector formation is as follows. 

4.2.1 Construction of qassification Vector by 
Relevance Feedback Module 

The classification vector c is constructed by the relevance 
feedback module 319 in step 430. The relevance feedback 
module 310 has the following irqxits: 

a. The weighting factors 5p a, % and yfrom the weighting 
calculator 304; 

b. The query vector Q=<w^ . . . . . . w^ firom the 

request processor 306. 

c. The document vectors from the database 230 where 
each document i is represented by the identifier i and a 
vector, cwj, . . . w^^ . . . W;^ 

d. The set specifying the identifiers of documents 
which have been manually aimotated as being relevant, 
and the set specijfying the identifiers of documents 
which have been manually annotated as being 
nonrelevant, from user input 302. 

e. For each non-manually annotated document identified 
by the set U, the probability of relevance Pj of the 
document from the initial probability annotator 328 
during the initial iteration, and from the iteration fscb- 
ability annotator 322 (described in further detail below) 
during subsequent iterations. 

The constructed classification vector c is a vector 

where 

(^H ifw'a>0 
0 otherwise 

according to the equation: 

Vfbmtlme— £ Pi 

The first three elements of the right side of the above 
equation are the same as the elements of the Rocdiio 
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formula for classification vector construction in relevance 
feedback. The first element aw^ increases the weight of the 
kth indexing tern in ttit classification vector c in proportion 
to the wdglit the term has in the query vector. The factor a 
5 controls how much influence the weight of the term in the 
quay has on the final wei^t of the term in the classification 
vector c. 
The second dement 

increases the weight the kth indexing term in the classi- 
fication vector c in proportion to the average weight of the 
tenn in tiie documents which were manually annotated as 
rdevant First, the sum of the weights of the kth term across 
the set of manually annotated relevant documents (RJ is 
calculated Next, the average wdght of the kth term In ttie 
manuaUy annotated rdevant documents is calculated by 
^ multq)lying by 

1 

T35r • 

^ Finally, the average wdght is roultq>lied by p, which con- 
trols how much impact the average weight of the term in the 
manually annotated rdevant documents has on the weight of 
the term in the classification vector. 
The third dement 

30 

T-=r- ^ »a 

35 decreases the weight of the kth indexing term in the classi- 
fication vectc^ c in proportion to the avoage weight of the 
term in the documents which were manually annotated as 
not rdevant First, the sum of the weights of the kth term 
across the set of manually annotated nonrelevant documents 

40 ^) is calculated. Next, the average weight of the kth tern 
in the manually annotated nonidevant documents is c^cu- 
lated by multiplying by 

1 

45 

Finally, the average weight is mult^lied by 7, which controls 
how much intact the average weight of the term in the 
manually annotated nonrelevant documents has on the 
^ weight of the term in the classification vector 
The last two elements . 

55 

and 

rime ctU 

mcxlify the Rocchio formula by taking into account, for each 
oon-manually-annotated document identified by the set U, 
the d^ree to which it is believed that it is relevant (P^) and 
65 the degree to «1uch it is bdieved to be nomdevant (i-PD^ 
The factor 5j controls how much weight is given to the 
oon-manualiy-annotated documoits. The factors 



06/16/2003, EAST Version: 1.04.0000 



5,675,710 

11 12 

Space text retrieval systems include mechanisms for effi- 
1 1 dcDtly conqxiting the RSV*s for large numbers of docu- 

uj\~nme meats. 



play the same role as 5 

io the second and third terms, except that their denominators 
0^ and lUl-n^ are estimates of the number of relevant and 
non-rdevant documents, respectively, identified by the stt 
U. Thus, each of the machine annotated documents is treated 
as partially relevant and partially non-relevant, according to 
its Pj annotation. The fourth term in the equation inaeases 
the weight of the kth indexing term in the classification 
vector c according to the proportional average weight of the 
term in the machine annotated documents, where the pro- 
portion is defined by Pf. Similarly, the fifth term in the 
equation deaeases the weight of the kth indexing term in the ^ 
classification vector c according to the proportional average 
weight of the term in the machine annotated documents, 
where the proportion is defined by (l-PJ. 

Methods for implementing the Rocchio formula are well 
known in the art, and described in the Salton and Hannan ^ 
references above. It would be readily apparent to one skilled 
in the art of text retrieval to inq>lement the above described 
nKxlifications of the Rocchio formula using similar methods. 

Thus, the output of the relevance feedback module 310 is 
a classification vector c, whidi is formed by the above ^ 
equation. 

43 Operation of the Annotation System 220 

Tht annotation system 220 is then operated to modify the 
initial Pj annotations which were assigned to the non- 
manually-aimotated documents by the initial probability 
annotator 328. As discussed above, the annotation system 
220 includes the RSV formatting module 318, the search 
engine 324, the initial probability annotator 328, and the ^ 
iteration probability annotator 322. 

The first phase of the annotation process begins by 
producing a retrieval status value (RSV) for each document 
in the database. Then an RSV is produced for the query Q. 
These RSV*s are then used to produce formatted training 
data. 

The second phase of the annotation pfxxress passes the 
formatted data fi'om phase 1 to a logistic regression module 
312, producing the parameters of a logistic function. 

The third phase of the aimotation process applies the so 
logistic function to the RSV*s of the non-manually anno- 
tated documents, producing a new P^ annotation for each 
non-manually annotated document 

4.3.1 Operation of Search Engine to Produce 
Document Retrieval Status Values 

The classification vector c=<Wc, . . . w^^^ • • - 
produced by the relevance feedback module 310, along with 
the document vectors i=<w„ . . . w^ . . . wj> from the 
document database 230, are provided to the search engine 60 
324. which pcrfoims the classification function in step 435. 
The classification vector c is appUtd to all of the documents, 
both manually annotated and non-manually annotated, in 
order to produce a retrieval status value (RSV) for each such 
document Thus, an RSV is calculated for each document i, 63 
as described above in Section 2.3 A large RSV indicates that 
die document is more likely to be relevant Known vector 



43.2 Operation of RSV Formatting Module to 
Create Logistic Training Data 

The search engine 324 provides the calculated RSV*s to 
the RSV formatting module 318. The RSV fonnatting mod- 
ule 318 also receives die parameter 62 from the weighting 
caloilator 304, and the sets of identifiers of manually 
annotated documents, R^ and from user input 302. The 
RSV formatting module 318 also receives, on the first 
iteration, the initial probability estimates P; from initial 
probability annotator 328. On subsequent iteraticms^ it uses 
the P, values computed by the iteration probability annotator 
322. In step 445, die RSV formatting nKxfaile 318 acates 
logistic training data consisting of triples of numbers as 
follows. 

First, a triple for the query Q is created as: 

<RSV„ 1, IJO 

where RSV^ is set to the maximum of all document RSV*s 
received fipom the search engine 324. The second element of 
the triple, the integer 1 , indicates that the query is considered 
to belong to the dass of relevant documents, and die third 
element of the triple, the value 1.0, indicates that this triple 
is to be given a weight of 1 during logistic regression 
(described in furdier detail below). 

Second, for each document identified by the set R^ (i.e. 
those \^ch have been manually annotated as belonging to 
the dass of rdevant documents), a triple is created as: 

<RSV* 1, l.Q> 

where RSV, is the RSV of the manually annotated document 
as calculated by the search engine 324. The second element 
of the trq)le, the integer 1, indicates diat the document is 
considered to belong to the class of relevant documents, and 
the third element of the triple, the value 1.0, indicates diat 
this triple is to be given a weight of 1 during logistic 
regression. 

Third, for each document identified by the set ^ (i.e., 
those that have been manually annotated as not belonging to 
the dass of rdevant documents), a tnp\t is created as: 

<ASVt, 0, 1jO> 

whoe RSVj is the RSV of the manually annotated document 
as calculated by the search engine 324. Hie second element 
of the trq>le, the integer 0, indicates that the document is 
considered to not belong to the dass of relevant documents, 
and the third element of the trq)le, the value 1.0, indicates 
that this triple is to be given a wdght of 1 during logistic 
regression. 

Fourth, for each non-manually annotated document iden- 
tified by the set U, two triples are created: 

<RSV,lj);x5p. (I) 

<RSV^^l-pJ)c5,> (2) 

In the triple (1) above, RSV^ is the RSV of the document as 
calculated by die search engine 324. The second dement of 
the trq)le, the integer 1, indicates that the document is 
considered to belong to the class of relevant documents. 
During the first iteration, the fajOxx P, is the probability of 
relevance calculated by the initial probability annotator 328, 
and is used as a weighting factor such that the wdg^t given 
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to document treating it as a relevant document will RSV values of the non-manually annotated documents from 

be proportional to the current estimate of the probability that the search engine 324, are provided to the iteration prob- 

the document is relevant During subsequent iterations, &e ability annotator 322. The iteration probability annotator 

factor is the probability of relevance calculated by the 322 annotates each of the non-manuaQy-annotated docu- 

iteration probability annotator 322. The 82 paramo coo- 5 ments with a new estimated value of according to the 

trols how much weigiht the logistic training data representing following foimula: 

die non-manually annotated documents is given during 

logistic regression (as described bdow). In the triple (2) c<Ag*VtpyD 

above, RSVj is the RSV ofthemadnne annotated document i+tf^^^^^ 

as calculated by the search engine 324. The second element 10 

of the triple, the integer 0, indicates that the document is Stq>s 430 through 4^ are then repeated as follows, 
considered not to belong to the class of relevant documents. 

The factor 1-P, is used as a weighting factor such that the Operation of the Invention for Multiple 

weigjit given to document i, when treating it as a non- Iterations 

'^"'^ f .^'^^i^^'^rr^*^.*" the cu«nt «ti- 15 ^„ ,^,5 sup 430 on the second and each 

mateofthei>r«*abiliQ^thata«d«»n,ent.^^^ successive pass, the rrf^ance feedback module 310 

classofnon-rdevantdocunKnU.6,s«rvesthesamefunct.on ^.e estimated values of P, from the toation 

as m xnplt (l). probability annotator 322 instead of the initial estimated 

4.4 Construction of the Logistic Parameters values of P, calculated by the initial probability annotator 

m.i ^i*-. - J* T*pxr * ^ ^ 328. and uses these new estimated values of P, when 

Ihe log^tjc tnumng data created by the RSV fonnattmg classification vector c. This is the case L each 

module 318 is provided to a tog^tic regression module 3U .^^^^ time step 430 is reached. 

Hie logistic regression module 312 calculates paramcto^ a. . 

and b, from the received logistic training data in step 450. Sumlariy, ^cn control reaches step 445 on the second 

niese parameters a, and b, wiU be used as input to the 25 successive pass, the RSV fonnatting module 318 

iteration piobabiUty annotator 322 (desoibed below in con- estmiated values of P.. received from the 

junction with step 460). nie logistic regression module 312 '^^^^'^ probability annotator 322 instead of the initial 

will choose parameters a, and b such that the iteration cstunated values of P, calculated in the initial probahiHty 

probaWlity annotator 322, when provided with the RSVs of f s*™^"" 328, when creating the logistic training data. This 

the non-manually amiotated documents, will calculate csti- 30 " ^ successive time step 445 is reached, 

mates of P, for the non-manuaUy annotated documents. 4 5 Alternate Embodiment Varying User Inputs 
Tecfamques for pecfoiming logistic regression are well 

known in the ait For details, see A. Agresti, '^Categorical 1° the embodiment described atx>ve, it was assumed that 

Data Analysis," John Wley, New Yoik, 1990, and P. McCul- user provided a request T, documents manually anno- 

lagh and J. Ndder, *^naalized linear Models^*^ Chapman 35 ^ relevant, and documents manually annotated as not 

& Hall, London, 2nd edition, 1989. Logistic regression is relevant In an alternate embodiment, the system 300 can 

also a capability in commercially available statistics pack- opaate to produce a dassiAcatton vector c in. the absence of 

ages such as SPSS from SPSS, Inc. of Chicago, Bl. cither a user entered request or manually annotated docu- 
ments. If no user request is entered, the first tezm of the 

4.5 Test for Convergence ^ equation described in Section 4.2.1 may be dropped. If no 

The classification vector c produced by the relevance documents have been annotated as being class mmbers, the 

feedback module 310 along with the parameters a^ and b^ second term of the equation as described in section 4.2.1 

calculated by the logistic regression module 312, are pro- ^ dropped. If no documents have been annotated as not 

vided to a convergence test module 314. In step 455 the being class members, the third term of the equation 

convergence test module 314 tests for a termination condi- 45 <lescribed in Section 4.2.1 may be dropped. However, for the 

tion to determine if a satisfftctory classification vector c has resulting classification vector c to have reasonable 

been produced. This test will only be made during the effectiveness, either a user request or at lea^ one document 

second, and eadi subsequent, time that step 455 is reached. manually annotated as being a class member is required. 

Different termination conditions may be used dq)ending Such a modification to the preferred embodiment could be 

upon the q>ecific plication and user requirements. For 50 readily impl e m ented by one skilled in the ait given the above 

examine, the classification vector c, and the parametos a^ disclosure, 
and b^ found on the particular iteration may be compared 

with the vahics from the prior iteration. If the values arc ^' ^^^^^o^^ 

sufSdently dose, then the termination conditioD is satisfied. The foregoing Dialled Description is to be understood as 

Alternatively, the procedure can be executed for some fixed 55 being in every respect illustrative and exemplary, but not 

imniber of iterations. When the chosen termination condi- restrictive, and the scope of the invention disdosed herein is 

tion is reached, the result is the classification vector c, as not to be determined from the Detailed Description, but 

represented by 326 in FIG. 3 and 465 in FIG. 4. At this point, rather from the daims as interpreted acccrding to the fiill 

the classification vector c may be used, for cxanq>le, to breadth permitted by the patent laws. It is to be understood 

categorize or retrieve documents from a database <^ docu- 60 that the embodiments shown and described herein are only 

meats. If the chosen lamination conditioD is not satisfied, illustrative of the principles of the prescat invention and that 

then the procedure continues widi step 460. various modifications may be impipmentf^ by those skilled 

4.6 Re-Estimation of Probability of Rdevance for ^ ^J!^ departing from the scope and spuit of the 

Non-Manually-Amiotated Documents r*'''^!^ ^ ^ ' 1 

^ 65 descnbed usmg the two classes relevant and non-relevant 

Continuing with Step 460, the parameters a^ and b^ However, the ^oesent invention is not limited to these two 

computed by the logistic regression iiK>dulc 312, and die classes. In addition, the invention may be readily extended 
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to ajpply to problems involving more than two classes, for pcrfonning a relevance feedback fiinrtion using said 

instance by treating a problem involving n classes as n madiine annotated document, 

binary classification problems. Fiutfacr, an embodiment of 12. The method of claim 11 wherein said step of calcu- 

the invention may use a model of text retrieval other than the lating a degree of relevance further conqsrises the step of: 

vector sfmce model, and a supervised learning method other 5 calculating an estimate of the probability that said non- 

than the Rocchio algontbm. manually annotated documi»it belongs to the defined 

I claim: dass. 

I. A method for training a classifier to classify at least one 13. jhc meftod of rtnim U wherein said step of pcrform- 
documcnt which has not been manually annotated with ing a relevance feedback function further comprises the step 
respect to a defined dass; 10 of: 

perf(Hming an operation on data including a retrieval performing the relevance feedback function using a 

status value associated with the document to generate at manually annotated document 

least one parameter value; 14. The method of claim 11 wherein the operation 

calculating a degree of relevance representing the "^^S'^^'^cWSn 11 wherein the data includes 

to which said document belongs to said defined dass, ^^^^^ trainmg data. 

said degree of rdevance being a function of at least the 7^ method of daim 11 wherein tfie parameter value 

retrieval status value and the parameter value; and indudes a value of a logistic parameter. 

training said dassificx using said degree of rdevance. 17. An ^aiatus for training a classifier to classify at least 

Xniemethodofdaimlfurthcrcomprisingthestcpsof: 20 onedocument whidi tu^^ 

. ,^ . , / T respect to a defined class, said apparatus c<mqHismg: 

automatically amiotatmg said at least one document with ^ ^ f^r pcxforming an operation on 

said calculated degree of rdevance to produce at least ^ induding a retrieval status value assodated with 

one automaticaUy annotated document; and ^ document to generate at least one parameter value; 

wherein said step of training said dassififf further com- 25 an annotation processcr for automatically annotating the 

l^es the step of training said classifier using said at document to produce at least one automaticaUy anno- 

least one automatically annotated document. tated document, said annotation induding a degree of 

3. Hie method of claim 2 wherein said step of training rdevance lepresenting the degree to which said at least 
said classifier using said at least one automatically annotated ^ne automaticaUy annotated document belongs to said 
document further con^ses the step of: ^ defined class, said degree of relevance bdng a function 

performing a rdevance feedback function using said at ^ i^^st the retrieval status value and the parameter 

least one automaticaUy annotated document to produce value; and 

a classification vector. ^ supervised learning processor for training the dassifici 

4. The method of daim 1 wherein said defined class is 3^ ^ automaticaUy annotated docu- 
defined by a request mcnt 

5. The method of claim 1 wherein said defined dass is ™- ^ « 1 • 1^ u • a 1 ,v 

defined by at least one manuaUy annotated document ^ f J^f ^y^^ ° "^^^^ «^ ^ 

6. The method of claim 1 wherein said defined dass is ^^^^Jt^ * request 

^A^i.u«i. ^ A 4 * The system of claim 17 whcrcm said supervised 

defined by both a request and at least one manually anno- ^ , . - . . 

tated document ^^""S composes: 

7. The method of claim 1 wherein said degree of rdevance » relevance feedback module. 

is prcportional to an estimate of the probabiUty of said at 20. The system of daim 17 wherein said annotation 

least one document in said first set bdng relevant to said system further comprises: 

defined class. dassification means for calculating a retrieval status value 

8. The method of claim 1 whcrdn the operation includes for said at least one document; and 

logistic regression. means responsive to said dassification means far calcu- 

9. The method of daim 1 wherein the data indudes degree of rdevance fw said at least one 
logistic training data. document. 

10 The method of daina l wheidn the para^ ^ ^1. TTic apparatus of daim 17 wherein the operation 

mdudes a value of a logishc parameter. lo J^regression. 

II. A method for producmg a dassification vector for use ^ ^mlof daim 17 wherein the data indudes 
m dassifymg at least one non-manuaUy annotated document ^^^^ tra^g data. 

with respect to a defined dass. said method conq>rising the spparatus of daim 17 wherein the parameter 

55 value indudes a value of a logistic parameter. 

performing an operation on data inchiding a rdrieval 24. An apparatus for training a dassifier to classify at least 

status value associated with the document to generate at qj^^ document which has not been roanuaUy annotated with 

least one parameter value; respect to a defined dass, said apparatus comprising: 

calculating a degree of relevance representing the degree means for performing an operation on data induding a 

to which said non-manuaUy annotated document ^ retrieval status value associated with the document to 

belongs to the defined dass, said degree of rdevance generate at least one parameter value; 

being a function of at least the retrieval status value and means for caloilating a degree of relevance rqresenting 

the parameter value; the degree to whidi the doomient bdongs to said 

automaticaUy annotating said non-manuaUy annotated 55 defined class, said degree of rdevance being a function 

document with said degree of relevance to produce a of at least the retrieval status vaUie and the parameter 

machine i^pf?n#atfjri document; and value; and 
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fiwinjg far crainiiig the classifier using said degree cf 
relevance. 

25. The apparatus of daim 24 fiirtfaa comprising: 

annexation means for automatically annotating said at 
least one document with said calculated degree ci 
relevance to produce at least one automatically anno- 
tated document; and 

wherein said means for training the classifier comprises a 
relevance feedback mechanism responsive to said 
annotation means for producing a classification vector. 
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26. The apparatus of claim 24 wherein said degree of 
relevance is proportional to an estimate of the probability of 
said at least one document being relevant to said defined 
class. 

27. The iQ^>aratus of daim 24 wherein the operation 
indudes logistic regression. 

2S. The apparatus <^ daim 24 wherein the data includes 
logistic training data. 

29. The qjparatus of daim 15 wherein the parameter 
value indudes a value of a logistic parametet 
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